Predict Bike Rentals
Posted on Dim 23 septembre 2018 in Machine Learning
Predicting Bike Rentals in Washington D.C¶
The data set contains 17380 rows of bike rentals on a single hour.
The goal is to predict the total number of bikes rented in a given hour ("cnt" hour).
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
bike_rentals = pd.read_csv('bike_rental_hour.csv')
print(bike_rentals.head())
bike_rentals["cnt"].hist(bins=50)
plt.show()
bike_rentals["cnt"].hist(bins=50, range=[0, 100])
plt.show()
print(bike_rentals.corr()["cnt"])
Calculating Features¶
--> Enhance the accuracy of models by introducing new information
def assign_label(row_hour):
if row_hour >= 0 and row_hour < 6:
return 4
elif row_hour >= 6 and row_hour < 12:
return 1
elif row_hour >= 12 and row_hour < 18:
return 2
elif row_hour >= 18 and row_hour <= 24:
return 3
bike_rentals["time_label"] = bike_rentals["hr"].apply(assign_label)
bike_rentals["time_label"].head(20)
Train / Test Split¶
train = bike_rentals.sample(frac = .8)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
Remove bad features¶
list_features = list(train.columns)
bad_features = ["cnt","casual","registered","dteday"]
for el in bad_features:
list_features.remove(el)
Linear Regression for predicting bike rentals¶
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(train[list_features], train["cnt"])
predictions = model.predict(test[list_features])
print(np.mean((predictions - test["cnt"]) ** 2))
The error is very high, which may be due to the fact that the data has a few extremely high rental counts, but otherwise mostly low counts. High Rental counts could be considered as outliers data because there is a few amount of these data. Larger errors are penalized more with MSE, which leads to a higher total error.
Decision Trees¶
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import roc_auc_score
model_tree = DecisionTreeRegressor()
model_tree.fit(train[list_features], train["cnt"])
predictions = model_tree.predict(test[list_features])
print(np.mean((predictions - test["cnt"]) ** 2))
model_tree = DecisionTreeRegressor(min_samples_leaf = 5)
model_tree.fit(train[list_features], train["cnt"])
predictions = model_tree.predict(test[list_features])
print(np.mean((predictions - test["cnt"]) ** 2))
Using a non linear predictor is much better, we have an higher accuracy than linear regression.
Random Forests : Improve the Decision Tree prediction¶
rf = RandomForestRegressor(random_state=1, min_samples_leaf = 2)
rf.fit(train[list_features], train["cnt"])
predictions = rf.predict(test[list_features])
print(np.mean((predictions - test["cnt"]) ** 2))
The accuracy of Random Forests is higher than Decision Trees because it removes sources of Overfitting.